AITopics | bad behavior

Collaborating Authors

bad behavior

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Download: LLM confessions, and tapping into geothermal hot spots

MIT Technology ReviewDec-4-2025, 13:10:00 GMT

OpenAI is testing a new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) own up to any bad behavior. Figuring out why large language models do what they do--and in particular why they sometimes appear to lie, cheat, and deceive--is one of the hottest topics in AI right now. If this multitrillion-dollar technology is to be deployed as widely as its makers hope it will be, it must be made more trustworthy. OpenAI sees confessions as one step toward that goal. Sometimes geothermal hot spots are obvious, marked by geysers and hot springs on Earth's surface.

large language model, machine learning, natural language, (16 more...)

MIT Technology Review

Country:

Asia > China (0.06)
North America > United States > New York (0.05)
North America > United States > Nevada (0.05)
(2 more...)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Energy > Renewable > Geothermal > Geothermal Resource Type (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.49)

Add feedback

OpenAI's new confession system teaches models to be honest about bad behaviors

EngadgetDec-3-2025, 21:05:53 GMT

OpenAI's new confession system teaches models to be honest about bad behaviors I guess AI gotta give part two of my confessions. OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they've engaged in undesirable behavior, an approach the team calls a confession. Since large language models are often trained to produce the response that seems to be desired, they can become increasingly likely to provide sycophancy or state hallucinations with total confidence. The new training model tries to encourage a secondary response from the model about what it did to arrive at the main answer it provides. Confessions are only judged on honesty, as opposed to the multiple factors that are used to judge main replies, such as helpfulness, accuracy and compliance.

large language model, machine learning, natural language, (12 more...)

Engadget

Genre: Press Release (0.75)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)

Add feedback

OpenAI has trained its LLM to confess to bad behavior

MIT Technology ReviewDec-3-2025, 18:01:39 GMT

Large language models often lie and cheat. We can't stop that--but we can make them own up. OpenAI is testing another new way to expose the complicated processes at work inside large language models. Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) owns up to any bad behavior. Figuring out why large language models do what they do--and in particular why they sometimes appear to lie, cheat, and deceive--is one of the hottest topics in AI right now. If this multitrillion-dollar technology is to be deployed as widely as its makers hope it will be, it must be made more trustworthy.

large language model, machine learning, natural language, (17 more...)

MIT Technology Review

Country:

North America > United States > Massachusetts (0.05)
Asia > China (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.65)

Add feedback

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Emmons, Scott, Jenner, Erik, Elson, David K., Saurous, Rif A., Rajamanoharan, Senthooran, Chen, Heng, Shafkat, Irhum, Shah, Rohin

arXiv.org Artificial IntelligenceJul-8-2025

While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on "unfaithfulness" has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.05246

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine (0.67)
Education (0.46)
Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

AIhub coffee corner: Bad practice in the publication world

AIHubFeb-7-2025, 14:41:06 GMT

This month we tackle the topic of bad practice in the sphere of publication. Joining the conversation this time are: Sanmay Das (Virginia Tech), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), and Sarit Kraus (Bar-Ilan University). Sabine Hauert: Today's topic is bad practice in the publication world. For example, people trying to cheat the review system, paper mills. What bad behaviors have you seen, and is it really a problem? Tom Dietterich: Well, I can talk about it from an arXiv point of view.

bad practice, publication world, sabine, (15 more...)

AIHub

Country:

North America > United States > Virginia (0.25)
North America > United States > Oregon (0.25)

Industry: Law (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.51)

Add feedback

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Li, Maximilian, Davies, Xander, Nadeau, Max

arXiv.org Artificial IntelligenceJan-29-2024

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

ablation, node, removing model behavior, (15 more...)

arXiv.org Artificial Intelligence

2309.05973

Country:

Europe > Netherlands (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Meta's prototype moderation AI only needs a few examples of bad behavior to take action

EngadgetDec-8-2021, 17:00:05 GMT

Moderating content on today's internet is akin to a round of Whack-A-Mole with human moderators continually forced to react in realtime to changing trends, such as vaccine mis- and disinformation or intentional bad actors probing for ways around established personal conduct policies. Machine learning systems can help alleviate some of this burden by automating the policy enforcement process, however modern AI systems often require months of lead time to properly train and deploy (time mostly spent collecting and annotating the thousands, if not millions of, necessary examples). To shorten that response time, at least to a matter of weeks rather than months, Meta's AI research group (formerly FAIR) has developed a more generalized technology that requires just a handful of specific examples in order to respond to new and emerging forms of malicious content, called Few-Shot Learner (FSL). Few-shot learning is a relatively recent development in AI, essentially teaching the system to make accurate predictions based on a limited number of training examples -- quite the opposite of conventional supervised learning methods. For example, if you wanted to train a standard SL model to recognize pictures of rabbits, you feed it a couple hundred thousands of rabbit pictures and then you can present it with two images and ask if they both show the same animal. Thing is, the model doesn't know if the two pictures are of rabbits because it doesn't actually know what a rabbit is.

bad behavior, harmful content, prototype moderation ai only, (8 more...)

Engadget

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Can AI Fortify Your Organization's Cybersecurity Strategy?

#artificialintelligenceSep-15-2020, 11:15:19 GMT

If, as we have seen in recent months, the rate of digital transformation has gone beyond anything we have seen in the past, it has also opened up many enterprises to attack in ways that have never been possible until now. Each time an organization adds new technology to the digital workplace it exposes itself to new risks. However, there are also new ways to protect their digital assets, just as there are new ways to ensure productivity. At the end of last year, Capgemini released research into how organizations are turning to artificial intelligence (AI) to protect their digital properties. Titled Reinventing Cybersecurity with Artificial Intelligence, it showed that 42% of the companies studied had seen a rise in security incidents through time-sensitive applications.

ai system, artificial intelligence, cybersecurity strategy, (11 more...)

#artificialintelligence

Country:

North America > United States > California > San Mateo County > Redwood City (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)
North America > United States > California > Los Angeles County > Los Angeles (0.05)
Asia > Middle East > Israel (0.05)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.77)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

From the Field: Machine Learning and Artificial Intelligence for Malware Prevention

#artificialintelligenceJan-5-2020, 19:05:01 GMT

For many years, the main threat protection products were based on signatures. It's time to think beyond the traditional antivirus (AV). I recently participated in proof-of-concept (PoC) testing of the CyancePROTECT agent and was deeply impressed with the product's AI-driven malware prevention capabilities in comparison to more traditional approaches. The following are some key observations of the PoC outcomes. For those of you who probably don't know, heuristics has been a technology designed to proactively detect malicious code, without having to have a specific signature.

machine learning and artificial intelligence, malware prevention, signature, (8 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.54)

Add feedback

CEO Behind Tinder, OkCupid on the Future of Online Dating

WSJ.com: WSJD - TechnologyDec-25-2018, 20:15:06 GMT

In her nearly 13 years at Match Group Inc., where she became chief executive in January, Ms. Ginsberg has watched the stigma of online dating fade almost entirely. Today, many people even proudly pursue a multiapp dating strategy. Match owns well-known dating apps including Tinder, Hinge and OkCupid, along with lesser-known brands such as PetPeopleMeet.com, The Dallas-based company is expanding in Latin America, Japan, South Korea and India to tap what it estimates is a market of 600 million singles. Her first year at the helm has been an eventful one. After unsuccessfully trying to acquire the dating app Bumble, Match sued its rival last spring for infringing patents for "swiping" and other features that have made Tinder popular.

artificial intelligence, ginsberg, social media, (13 more...)

WSJ.com: WSJD - Technology

Country:

Asia > India (0.26)
North America > Central America (0.25)
Asia > South Korea (0.25)
(2 more...)

Industry: Information Technology > Services (0.87)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

Add feedback